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OPTIMIZING THE ADAPTIVE FAST MULTIPOLE METHOD 

FOR FRACTAL SETS 
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Abstract 

We have performed a detailed analysis of the fast multipole method (FMM) in the adaptive case, in which 
the depth of the FMM tree is non-uniform. Previous works in this area have focused mostly on special types 
of adaptive distributions, for example when points accumulate on a 2D manifold or accumulate around a few 
points in space. Instead, we considered a more general situation in which fractal sets, e.g., Cantor sets and 
generalizations, are used to create adaptive sets of points. Such sets are characterized by their dimension, 
a number between 0 and 3. We introduced a mathematical framework to define a converging sequence of 
octrees, and based on that, demonstrated how to increase N —x oo. 

A new complexity analysis for the adaptive FMM is introduced. It is shown that the O(N) complexity 
is achievable for any distribution of particles, when a modified adaptive FMM is exploited. We analyzed 
how the FMM performs for fractal point distributions, and how optimal parameters can be picked, e.g., 
the criterion used to stop the subdivision of an FMM cell. A new subdividing double-threshold method is 
introduced, and better performance demonstrated. Parameters in the FMM are modeled as a function of 
particle distribution dimension, and the optimal values are obtained. A three dimensional kernel independent 
black box adaptive FMM is implemented and used for all calculations. 
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1 Introduction 

The N-body problem, notwithstanding its well established analytical difficulty jT2], has been studied exten¬ 
sively by means of numerical simulation. This problem has broad application in a number of fields such as 
celestial mechanics, molecular dynamics, fluid dynamics (e.g., vortex methods), solid mechanics (elasticity, 
cracks, dislocation dynamics, etc.), and plasma physics. More broadly, the problem of computing N 2 in¬ 
teractions among N points or N variables appears in the boundary element methods, problems involving 
radial basis functions (e.g., interpolation, meshless algorithms, etc.), or in probability theory to describe 
dense covariance matrices (e.g., seismic imaging, linear stochastic inversion, kriging, Kalman filters, etc.). 

The essence of all of these simulations is to compute particle-particle interactions. Considering N particles 
located at positions {a^}, the net contribution of these particles at some observation point y is calculated by 
a sum of the form: 

N 

f(y) =^2 K {xi,y)cri (i) 

i=i 

where K is some K 3 x R 3 4 R 3 function called the kernel, and a ; is the intensity of the i’th particle field 
(particle mass in the celestial mechanic example). The above formulation is essentially a matrix-vector 
multiplication, Act, where [A]ij = K(xi,yj). In general, the cost of this calculation is 0(N 2 ). However, the 
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fast multipole formulation introduced by Greengard and Rokhlin m allows a matrix-vector multiplication 
to be approximated with desired accuracy in 0(N ) time. This followed by several works to extend the 
algorithm for different kernels, analyze approximation, parallel implementation techniques, etc. We refer to 
some of the significant publications and our previous works 

An FMM matrix is a special class of "H 2 -matrices with analytical low-rank off-diagonal blocks. 'H 2 -matrix 
itself is a subclass of a larger category of hierarchical matrices called W . There are many fast linear algebra 
techniques for different classes of 'H-matrices [2]. As is explained in tj2j the main idea of fast linear algebra 
techniques for the hierarchical matrices is the low-rank approximation of interaction between well-separated 
clusters. In the context of celestial mechanics this idea translates to aggregating the effect of very far planets 
and computing their net gravitational force through a more efficient representation. 

The adaptive FMM refers to the case where the particle distribution, and its corresponding hierarchical 
tree, is not uniform. The extension of the uniform FMM is well explained in Email, followed by various 
aspects of its parallel implementation on different machines as discussed in [IS EE, and many other papers. 

Although the algorithm has been described several times in the past, previous papers have limited their 
analysis to very specific point distributions. We will detail this point shortly. The key point essentially is 
the manner in which points are distributed, in a non-uniform adaptive setting, as N goes to infinity. In the 
uniform case, the issue of increasing N presents no particular difficulty. We can simply increase the density 
of points uniformly and study how accuracy and parameters in the FMM are adjusted as a function of N. 

The non-uniform case however is more difficult. One essential point is describing the process of adding 
points so that N ^ oo. The adaptive test cases considered by most previous works fall broadly into the 
following categories: 

1. A small number of subregions are picked (e.g., n spheres) and points are progressively added to each 
subregion by distributing them with some smooth distribution (e.g., uniform, Gaussian, etc.) inside 
each region. Then the diameter and distance between regions are varied mmm- 

2. Manifolds are considered, that is surfaces or lines. Then, points are added on these manifolds again 
using a randomly uniform distribution eeem- 

3. Points are chosen such that they accumulate at some location, for example point i is chosen as Xi = 1/i 2 

0 [ll [231 [391 SO]. 

We note that complex cases have been considered such as in [TOJ [41] but in those particular cases N was 
fixed. 

All these cases represent only a small set of possible situations. There are many more ways to create 
non-uniform distribution of points. In this paper, we focus on the third case, in which points accumulate. 
However, we extend this situation to points that essentially accumulate at an infinite number of locations. 
This naturally leads to fractal sets. 

Fractal sets are encountered in several problems, notoriously when creating models of the universe [6, 
1251 [221 m (Ml [351 135] . Recently, antennas with fractal geometries E EH EH EH S3] have been studied. 
Their fractal shape allows a more compact design (taking advantage of the space-filling properties of fractal 
curves). Such contours are able to add more electrical length in less volume. In [33] , the method of moments 
and a variant of the fast multipole method m is used to study iterated Sierpinski microstrip patch antennas 
(Sierpinski pre-fractal sets). Although the literature on applying the fast multipole method to fractal sets 
is limited, we mention [26] . which analyzed fast Gauss transforms and related fast methods for non-uniform 
data sets, described using their lacunarity [2| (a concept related to the Hausdorff dimension for fractal sets). 

Fractal sets, in addition to their physical importance, are interesting from the numerical point of view. 
Some previous works have reported a strong correlation between the computational cost of the FMM, and the 
dimension of points, based on observations for limited cases with an integer dimension (1, 2, or 3) |I7] [20l 122] . 
Fractal sets can have any dimension between 0 and 3 (not only integers), while filling a three-dimensional 
box. This more general definition of dimension is known as the fractal dimension, box-counting dimension, or 
Hausdorff dimension [SUET] . Specifically in our numerical benchmarks, one of our main examples is a triple 
tensor product of the generalized Cantor sets, which provides all range of box-counting dimensions varying 
continuously from 0 to 3. Fractal sets, generally results in an adaptive FMM tree with leaf nodes sparsely 
distributed at different depths of the tree. Therefore, fractal sets with a continuously tunable dimension, 
are an interesting benchmark for the study of performance of the adaptive FMM codes. Furthermore, the 
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performance of an adaptive FMM algorithm applied to a regular low dimensional benchmarks (e.g., a 2D 
manifold in a 3D box) highly depends on the orientation of the set of points. In contrast, many fractal sets 
maintain their adaptive properties with rotation (i.e., they are more isotropic). This is the case for example 
when using tensor products of Cantor sets. 

In )J2] we briefly introduce the adaptive FMM algorithm, and set of notations and definitions that are 
used in the rest of the paper. It is known that the adaptive FMM algorithm maintains the 0(N ) complexity 
irrespective of the point distribution Gang. This requires a modification to the original FMM. In f|3j a 
new proof for the linear complexity of the adaptive FMM is introduced. We analyze the complexity of 
every interaction in the adaptive FMM step by step. This makes it apparent what modifications to the 
original FMM are required to ensure O(N) complexity for a general particle distribution. Moreover, the 
proof provides a heuristic for the reader about the effect of point distribution on the complexity of each part 
of the adaptive FMM. 

Using a C++ adaptive FMM code (which can be downloaded from stanford.edu/~hadip/AFMM.tar) 
based on the black-box algorithm of [16], the aforementioned fractal point distributions are studied, along 
with a detailed counting of the number of floating point operations. In f[4] the calculation begins with some 
standard cases (e.g., uniform, spiral, etc.), and evidence of how the point distribution changes the FMM cost 
and the optimal parameters is presented. The calculation is extended to general fractal sets afterwards. 

When studying the adaptive FMM, the process of increasing N should be considered rigorously. If we 
simply consider a sequence of points 2+ and define the set Sn = {xi}i=i,...,N, the sequence of sets Sn can 
only converge to a set with a countable number of points (by construction). However, we are interested 
to study the performance of the adaptive FMM for a sequence of sets that is converging to a fractal set 
(e.g., Cantor set). In ^5j we introduce the concept of super octree, and how to define a sequence of octrees 
converging to a super octree. Each super octree is associated with a set of points (e.g., a fractal set). This 
provides the required mathematical framework to study a sequence of finite adaptive FMM problems (with 
N —> oo), where point distributions have properties converging to the properties of some target infinite (and 
uncountable) set of points such as the Cantor set. In our numerical benchmarks we consider generalized 
Cantor sets that are constructed based on a recursive definition (find the code to generate points from 
stanford.edu/~hadip/CANTOR.tar). The points Xi are generated by going through k iterations of this 
recursive process. As k —> oo, N goes to infinity in a well-defined manner. 

In the first part of f[6j we propose a new strategy to build the adaptive tree. We focus on the criterion 
used to determine whether a cell needs to be further subdivided or not. The original bisection algorithm 
uses one threshold determined a priori (e.g., [Hum HU). Previous papers have also considered the median 
point, although that was mostly in the context of parallel implementations, which is also based on an a 
priori threshold [TB] - We are proposing to use two independent parameters simultaneously: the maximum 
number of points per leaf cluster, and the maximum number of levels (depth) Z max of the tree. Therefore, 
we need to solve an optimization problem to find the optimal parameters. 

Using two independent parameters turned out to be very important in order to reduce the computational 
cost. The main observation is that a leaf at level Z max generates mostly M2L operations, while leaves at 
levels lower than Z max also generate M2P and P2L operations. As we reduce the leaf particle threshold, we 
are replacing leaves at Z max by leaves at lower levels, which, in general, leads to fewer M2L and more M2P 
and P2L. These operators are not equivalent. M2L is pre-computed and can be accelerated using various 
techniques such as the fast Fourier transforms. In contrast, M2P and P2L are typically not precomputed 
(because this leads to a large memory storage) and are evaluated on the fly, as needed. They are consequently 
much more expensive. As our benchmark tests show, replacing M2L by M2P and P2L is disadvantageous. 
As a result, the particle threshold should be chosen very small so as to minimize the appearance of M2P 
and P2L. In particular, one outcome of this optimization is that the number of particles for clusters at Z max 
should be much larger than for leaf clusters at lower levels. This is in contrast with most adaptive FMM 
codes, that use a uniform threshold for all leaf nodes. These codes pay a high cost by generating too many 
expensive M2P and P2L operations. 

In our numerical benchmarks, we introduce a near-optimal solution to this optimization problem, and 
demonstrate a better performance comparing to the conventional scheme. In the case where a very cheap 
M2L is possible (e.g., using fast Fourier transforms), our near-optimal solution brings about an even more 
dramatic computational cost reduction. One of our conclusions is that using the X-list and W-list (defined 
in © in the adaptive FMM is not necessarily advantageous. 
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In section 6.2 we study how parameters in the FMM such as the optimum total number of levels and 
the maximum number of points per leaf cells can be optimized as a function of the dimension of the set. We 
consider dimensions ranging continuously from 1 to 3, and show a strong dependence of these parameters 
on the dimension. Other details of the distribution appear to be less important. Our analysis is based both 
on mathematical bounds and estimates, as well as numerical benchmarks and investigations. Theoretical 
estimates for the optimal parameters can be found for uniform distributions [22) . whereas for a generic 
adaptive distribution not much is known. Most implementations, if not all, manually or heuristically tune 
parameters to get the optimal values of parameters (e.g., 1). 

One can extend our analysis to design an automatic way of choosing the optimal parameters. For example, 
we can define local parameters for different parts of the domain (i.e., the maximum depth and threshold do 
not have to be globally defined). However, this is a separate work that is not addressed here. 


2 Adaptive Fast Multipole Method 

We briefly introduce the adaptive FMM in this section, and refer the reader to 03] for a detailed explanation. 
The fundamental idea in FMM is to avoid detailed computation among particles that are far from each other, 
and approximate these with some low rank operator. This reduces complexity from 0(N 2 ) to O(N). Hence, 
the main idea is to cluster points. In order to keep track of different clusters of particles we can use a tree 
data structure. In this section we explain and illustrate the scheme in a 2D domain for simplicity. Also, 
hereafter we assume observation points are the same as particle locations (i.e., y is one of the xfi s in Eq. §). 

Consider a set of N particles in a square domain, [0, if C R 2 . To build our hierarchical data structure of 
clustered particles, we start from the original box (which is a 1 x 1 square in our 2D example), and subdivide 
it into four identical sub-domains. We keep subdividing new boxes as long as the number of particles per box 
is greater than some threshold t , determined a priori. For instance, in Figure[l]the original box is subdivided 
up to 5 levels, assuming subdividing threshold is t = 8. The largest box corresponds to the root of the tree, 
which has four children associated with four sub-boxes. Each of these four children may have up to four new 
children, and so forth. Therefore, each box in the geometrical (physical) domain is corresponding to a node 
in the tree (this is defined below). Note that in a d-dimensional computation, each node has 2 d children, 
therefore, a 2 d -tree is the proper substitution for the quad-tree of the 2D case (e.g., octree if d = 3). 

Definition 1 (physical interval associated to a tree node) As described above, the physical cube associated 
to a node M in the octree is defined recursively as a l/2 d cube of the physical cube of M’s parent in the 

tree. We denote this cube by X(M) C [0, l\ d , which is a closed subset ofM. d . d € {1,2,3} is the topological 

dimension of the problem. 

Now, based on the FMM’s idea, we have to distribute the total computation amongst different nodes of 
the tree (i.e., each box in the domain). We will use the following definitions for the hierarchical tree similar 
to [1331 . 

Definition 2 (node relations in the tree) 

• Node A is adjacent to node B if 1(A) n 1(B) 0. 

• Node A is a parent of node B if B is child of A in the tree. We denote it by A = V(B). 

• Depth of a node A in the tree is defined recursively as 5(A) = S(V(A)) + 1. Depth of the root is 0. 

• Node A is a colleague of node B if they are adjacent and 6(A) = 5(B). 

• Node A is a leaf if it has no children. It corresponds to boxes with at most t particles. 

In the FMM, two sets of coefficients are defined for each node: multipole and local coefficients. Local 
coefficients consist of the contribution of all far particles to the particles within the node of interest. Multipole 
coefficients include the contribution of particles inside the node of interest to any far particle. The number 
of multipole and local coefficients assumed to be constant irrespective of the depth of the node in the tree. 
Considering a leaf cluster of particles, say C, we can divide its surrounding area into three regions: vicinity , 
separated , and far (Figure [2]). In the following these regions are defined. 
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Figure 1: A 2D non-uniform points distribution domain. The threshold for subdivision is t = 8. 



Figure 2: Vicinity and separated regions around a node in the FMM tree. 
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Definition 3 (regions around a cluster of particles) 

• The vicinity of a node C is defined as V(C) = U Q I (a), where a is a potential colleague of C, whether 
it exists in the tree or not (hence, the union is over 8 cubes in 2D, and 26 cubes in 3D). 

• Node a is separated from C if aC\ V(V(C)) ^ 0 and a fl V(C') = 0. 

• Node a is well-separated from C, if it is separated from C and 1 5(a) < 6(C). Therefore, If a is well- 
separated from C, then C is certainly separated from a. Note that in the context of the adaptive FMM, 
being well-separated is a one-way relation. 

• Node a is far from C, if it is neither in vicinity of C nor separated from C. 

The interaction of particles in a cluster C with the particles within its adjacent area is presumably not 
low-rank, and a direct particle-to-particle (P2P) calculation is required. The interaction with particles in the 
separated region, however, can be approximated with a low-rank operator, M2L (multipole to local). The 
particles lying in the far region, by construction, are within the separated region of C’s ancestors. Therefore, 
C inherits the far-field contribution from its ancestors through the L2L (local to local) operator, and transfers 
the effect of its local particles through the the M2M (multipole to multipole) operator to its ancestors. In 
order to implement this hierarchical data structure to transfer information among nodes, we introduce the 
following lists for every node C of the tree: 

Definition 4 (interaction lists of a node C in the tree) 

• U-List (only defined for leaf nodes): a £ U-List(C) if and only if a is a leaf node adjacent to C. 

• V-List: a £ V-List(C) if and only if both a and C are well-separated from each other. 

• W-List (only defined for leaf nodes): a £ W-List(C) if and only if C is well-separated from a, but a is 
not well-separated from C. 

• X-List: a £ X-List(C) if and only if C £ W-List(a). 

Remark 

C £ U-List (C). 

a £ U-List(C) implies that C £ U-List(a). 

a £ V-List(C) implies that 6(a) = 6(C). Also it implies that C £ V-List(a). 
a £ X-List(C) implies that a is a leaf (since it has a W-List). 

In Figure [3] the aforementioned lists are depicted for a leaf node of Figure [l] as an example. Essentially, 
the U-List is designed to include all the nodes within the adjacent region. Therefore, a direct interaction with 
nodes in the U-List is required. The concept of separation provides the necessary condition for a low-rank 
approximation. V-List, W-List, and X-List include nodes with low-rank interactions. Consider two nodes a 
and fl. When fl is separated from a , the effect of source points in node a to the observation points in node /3 
is approximated by a’s multipole coefficients. Also, the effect of source points of [3 to the observation points 
of a is added to the a’s local coefficients. 

According to the above explanation, the adaptive FMM algorithm uses the following 8 sub-algorithms to 
compute the effect of all source points on the all observation points. 

1. P2M: Aggregate particle information into the multipole coefficients. This is applied to all leaf nodes. 

2. M2M: Compute the contribution of multipoles of a node on the multipoles of its parent (use post-order 
tree traversal). 

3. M2L: Low-rank interaction between multipoles and locals of two separated nodes (use V-List). 

4. P2L: Low rank interaction between particles and local coefficients (use X-List). 
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5. L2L: Compute the contribution of locals of a node on the locals of its children (use pre-order tree 
traversal). 

6. M2P: Low-rank interaction between multipoles and particles (use W-List). 

7. P2P: Direct interaction between particles (use U-List). 

8. L2P: Interpolate local coefficients to particle location (applied to all leaf nodes). 
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Figure 3: Example of U, V, W, and X lists for a leaf node C in the example of Figure [T] 


3 Proof of linear complexity 

In the previous section, we discussed how the complete algorithm consists of eight sub-algorithms: P2M, 
M2M, M2L, L2L, L2P, P2L, M2P, and P2P. Here, we show that the adaptive fast multipole method has an 
O(N) operation count, where N is the number of particles. We assume observation and source points are 
the same, and are located in R 3 . Also we assume that every low-rank approximation has a constant rank, 
independent of the depth of the nodes. 

Note that for the adaptive case, the octree is not necessarily full (i.e., each node does not necessarily 
have 8 children filled with particles). Therefore, the depth of the octree for the adaptive case is not bounded 
by (D(\ogN), compared with the uniform case. Nabors et. al. have proposed a proof for linear complexity of 
the adaptive FMM [ 52 ] . [ 52 ] does not use the X and W lists. Instead, clusters interact only with clusters at 
the same level (with the exception of non-divided nodes, see [ 52 ] p. 721). [ 5 ] also provides a proof for linear 
complexity of the FMM independent of distribution of points. However, they do not discuss the role of the 
subdividing threshold, and the adaptive interaction lists X and W. In [38] the parallel implementation of 
their distribution independent algorithm is discussed. We present a new proof of the linear complexity with 
the usual set of interaction lists U, V, W, and X. This is done for each sub-algorithm separately. During the 
proof we introduce some modifications to the basic algorithm, which is necessary to get an O(N) complexity. 
Here, we assume a generic adaptive octree consists of N particles with a given subdividing threshold t , and 
prove the linear complexity step-by-step. 

Lemma 1 (P2M & L2P) In any adaptive FMM tree with N particles, there are exactly N P2M, and N L2P 
operations. 
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Figure 4: Adaptive binary tree (for the ID FMM). In the above tree, assuming that the subdividing threshold 
is 2, there is a total of 10 observation/source points. 


Proof The number of P2M (and similarly L2P) operations is independent of the particles distribution and 
subdividing threshold. Only the number of nrultipole/local coefficients per node (which depends on the 
algebraic low-rank approximation, e.g., Chebyshev polynomial interpolation) is relevant. Hence, there are a 
total of 2 N P2M and L2P operations regardless of particles distribution and the subdividing threshold. □ 

Lemma 2 (P2P) In any adaptive FMM octree with N particles, the number of P2P operations is O(N). 

Proof Define a directed graph G(V , E) as follows. There is a one-to-one mapping between the leaf nodes in 
the octree and the vertices in V. Then, there is an edge € E from vertex Vi to vertex Vj if and only if, 
node i is adjacent to node j in the octree, and the depth of node j is less than or equal to the depth of node 

i. 

In the octree, each node has at most 3 3 — 1 = 26 adjacent nodes at depth less than or equal to its own 
depth. Therefore, in the graph G , each vertex has at most 26 outgoing edges. This means that there are at 
most 26Ni edges in the graph, where Ni is the number of leaf nodes in the octree, and clearly, A) < N (in 
an average sense Ni ~ N/t). A P2P operation between two nodes takes 0(t 2 ) CPU cycles. Hence, the total 
number of CPU cycles associated to the P2P part of the algorithm is 0(Nit 2 ), which is G(N) for a given 
threshold t. □ 

For the rest of our complexity analysis, we need to define the concept of extended tree. This helps us to 
find appropriate upper bounds, as will be explained. 

Definition 5 (extended tree) The extended tree is obtained from a basic tree (Figure^ by performing the 
following modifications. 

• Denote the depth of the tree by l max . 

• For every non-leaf node consider all of its 8 children. Some of them are occupied with particles, and 
some of them are empty. We call the former an occupied node, and the latter an empty node. Add 
empty nodes to the tree. 

• Consider all leaf nodes with depth less than l ma x- They may be occupied or empty. Subdivide leaf nodes 
to 8 children, and continue this process until reaching the deepest level of the tree (i.e., l m ax)- 

Doing the above modifications, we end up with a tree whose leaves are all in the deepest level. Leaves are 
either occupied or empty. For instance, the binary tree with 10 particles shown in Figure^transforms to the 
extended binary tree in Figure [5[ 

Lemma 3 (M2P & P2L) In the extended adaptive FMM tree, there are no M2P and P2L operations. 

Proof By definition, in the extended octree all leaves lie in the deepest level, so no M2P and P2L is possible. 

□ 

Note that as we transform the basic octree to the extended octree, each M2P operation transforms to 
new M2L operations (at lease one new M2L). For instance, in Figure [5j the M2P operation between nodes 
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• Occupied node 
O Empty node 


depth =0 



depth =1 


depth =2 


depth =3 


Figure 5: Extended adaptive binary tree corresponding to Figure [4] By definition, in the extended tree all 
leaves lie in the deepest level. An example of an M2P operation transformed to a new M2L operation is 
depicted. 


V4 and V12, which exists in the basic tree of Figure |4j transforms to an M2L between nodes VIO and V12. 
Basically, in the tree extension algorithm, all M2P and P2L (which is the dual of M2P) operations transform 
to new M2L operations, such that the number of M2L operations in the extended tree is not less than the 
total number of M2Ls, M2Ps, and P2Ls in the original octree. Therefore, it is sufficient to show that number 
of M2L operations in the extended octree is linear with N. Note that the number of CPU cycles associated 
to each M2P and P2L operation is only function of t and r (the low-rank approximation), and the flops of 
each M2L is only function of r. Hence, assuming constant t and r, it is sufficient to count the total number 
of operations to show the linear complexity with respect to N. 

Definition 6 (divided node, parent, and children) 

• A divided node in the tree is an occupied node that has at least two occupied children. Similarly, a 
singleton node is defined as an occupied node that has exactly one child. Note that leaves and empty 
nodes are neither divided, nor singleton. 

• The divided parent of a node (3 is defined as its deepest ancestor who is a divided node or the root. 
For instance, in Figure^V 1 is the divided parent ofV 10. 

• The divided children of a given node (3 is defined as its least deep (shallowest) descendant which is a 
divided node or a leaf. 

Lemma 4 (number of divided nodes) For a generic adaptive extended octree T, the number of divided nodes 
is less than N. 

Proof We can show a stronger result by induction on the depth of the extended octree, l: 

E (« - !) = N l - 1 (2) 

vie occupied non-leaf nodes 

where, d(vi) is the number of occupied children Vi has, and N[ is the number of occupied leaves. 

The case l = 0 is trivial. Consider an extended octree with depth l. Note that all leaves in the extended 
octree lie in the deepest level. Remove the last (deepest) level of the tree. Call the remaining octree T', 
which is an extended octree with depth l — 1, and Nj occupied leaves. From the induction hypothesis the 
equality holds for V. Now, subdivide all leaves in T' to rebuild T. Empty leaves do not contribute to the 
above summation for both T and T'. Consider an occupied leaf v[ in T' that is subdivided to div'f) occupied 
leaves in T. Essentially, going from T' to T, we loose one leaf (v'f), and add d{v'f) new leaves. So the total 
number of leaves is increased by — 1 ) = Ni — N[. Add ^(d(i^) — 1) to the left hand side, and 

Ni — N{ to the right hand side of the equality for T' to obtain the equality for T. 

Now, note that d{vf) — 1 is non-zero for all divided nodes. So: 

Number of divided nodes < E (d(vi) — 1) = Ni — 1 < N — 1 (3) 

occupied non-leaf nodes 
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The last inequality holds, since each occupied leaf consists of at least one particle. □ 

Lemma 5 (M2L) In any adaptive FMM tree with N particles, the number of M2L operations is O(N). 

Proof To show linear complexity, we build the following bipartite graph. One set of nodes is the set of 
all M2L operations. The other set is the set of all divided tree nodes. We construct edges from each M2L 
operation to one or two divided tree nodes. We showed previously (Lemma [4]) that the number of divided 
nodes is 0{N) (in fact, it is 0(N{)). We will then bound the maximum number of edges incoming on each 
divided node. This proves that the maximum number of M2L operations is O(N). 

Consider two generic nodes a and (3. In a tree, there is a single path that connects a to f3 and that visits 
a node at most once. Edges in our bipartite graph connect an M2L operation between two generic nodes 
a and /3 to a node S such that S is the deepest divided node in the path from a to B including a 
and /3 themselves. Since the path is unique, there are at most two nodes S that are connected to an M2L 
operation. 

Consider for example Figure [5] The M2L operation between nodes VIO and V12 is connected to node 
V12, and the M2L operation between nodes V10 and V7 is connected to node V3. In some cases, the 
divided node S is not unique, e.g., the M2L between nodes V3 and V5 in Figure [5] In that case, the M2L 
is connected to both possible divided nodes. 

Now, we show that the maximum number of edges incoming on some divided node S is bounded by a 
constant. 

If the M2L operation between nodes a and (3 is mapped to S , by definition, at least one of a and (3 
should be a descendant of S, or S itself. 

If one of a or [3 is S , then there are at most 6 3 — 3 3 = 189 M2L operations of this type mapped to S, 
since this is the maximum size for the V-List of S. 

Consider now all M2L operations connected to S such that neither a nor [3 is S. Without loss of generality, 
assume that a is a descendant of S (if not true, then /? must be a descendant). Node a must belong to one 
of the children of 5, K\, I\ 2 , ..., or K$. 

For each K., , we now consider the maximum number of M2L such that a is a descendant of Ki or JQ 
itself. Given Ki , there are at most 215 not-far neighbors^ W,i, -W, 2 , • • •, Ni, 2 i 5 - At most 189 of them are 
well-separated from K t , and at most 26 are colleagues of Ki (they cannot be Ki itself because j3 cannot be 
in Ki at this point). The node f3 must be a descendant of one of these not far neighbors (this is true by 
definition of S). 

We now show that for any pair (. Ki,N it j ), there is at most one set {a, f3} such that: a is Ki or its 
descendant, (3 is Nij or its descendant, and there is an M2L between a and (3 that is mapped to S. If we 
prove this, we will have established that there cannot be more than 189 + 8 x 215 = 1909 incoming edges on 
S. 

For this final point, there are now only 2 cases. If N t j is well-separated from Ki (i.e., N,^ is in the V-List 
of Ki) then we must have a = Ki and /3 = Nj. In this case, we are guaranteed that there is no more M2L 
between descendants of a and (3. The set {a,/3}, if it exists, must be unique. 

If Nij is a colleague of K, and the M2L operation {a,/3} exists, both N,j and K, must be singleton 
nodes (otherwise, the M2L would not be mapped to S). Because of the definition of S, we further have that 
the two subtrees starting at Ki and Wj must be branches of singleton nodes, at least until a and f3 are 
reached. Moreover, no other M2L interaction can exist along these two branches. Essentially to find a and 
/3, we simply go down the tree starting at K, and Nij until we find two nodes that are well-separated. These 
must be a and f3 and, consequently, the set {a, (3} is unique. □ 

So far, we have discussed the complexity of six out of eight operations of the adaptive FMM. However, 
without the following modification, the basic adaptive FMM is not O(N). In fact, for a given N , it can 
be arbitrarily large compared with N. Consider a very long branch of singleton nodes ending at a leaf (see 
Example [4]). This branch of tree may be generated only due to one particle, and there is no limit on its 
depth. Such cases also have been considered in [3j. Consequently, performing the basic M2M (and similarly 

1 A not-far neighbor of a node C is a node with the same depth, which is either its colleague or a child of its parent’s 
colleagues. This is basically the union of the colleagues and the V-List nodes (Figure [2j. Therefore, each node has at most 
6 3 — 1 = 215 not-far neighbors. Note that C or any of its descendants may only have M2L interactions with one of C’s not-far 
neighbors and their descendants. 
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L2L) operator on this branch, we could add infinitely large cost to the algorithm. This is why we have to 
modify the algorithm. This modification is similar to the modification introduced in [3]. 

Definition 7 (modified adaptive FMM) In the modified adaptive FMM algorithm we do not consider mul¬ 
tipole and local coefficients for singleton nodes. Specifically, for M2M, we pass the multipole coefficients of a 
non-singleton node directly to its divided parent (as opposed to passing the multipole coefficients step-by-step 
through all the singleton ancestors). Similarly for L2L, we pass the local coefficients of a divided node di¬ 
rectly to its divided children. For example, in the binary tree of Figure^ the multipoles of node V10 directly 
contribute to its divided parent VI. 

Remark M2L operations for singleton nodes, which in the modified algorithm do not have multipole and 
local coefficients anymore, are being taken care by their divided children. Moreover, going from the basic 
algorithm to this modified algorithm, the number of M2L operations does not change. Therefore, the proof 
for the M2L still holds. 

Lemma 6 (M2M & L2L) In the modified adaptive FMM algorithm with N paHicles, the number of M2M 
and L2L operations is O(N). 

Proof As explained, in the modified adaptive FMM, M2M and L2L operate between a divided node and 
one of its non-singleton children. From lemma [4j we have a bound on these pairs: 

Number of divided nodes < ^ (d(vi) — 1) = Ni — 1 < N 

Vi£ occupied non-leaf nodes 

=> ^ d(vi) = I ^ (d(vi) — 1) I + Number of divided nodes < 2 N 

Vi£ divided nodes divided nodes / 

This shows that the number of pairs of nodes that require M2M and L2L operators is bounded by 2 N (in 
fact, it is bounded by 2N{). □ 

4 Numerical consideration 

In £[3J we established an upper bound for each sub-algorithm of the adaptive FMM. In this section, we 
compute the CPU cylce count of the algorithm, and then show some numerical results to verify the linear 
complexity. We chose a low-rank approximation based on the Chebyshev polynomial interpolation introduced 
in pTO] . 

4.1 CPU cycle count 

Using the scheme described in m, we assumed r Chebyshev points in each direction. Hence, for a 3D calcu¬ 
lation, each node has r 3 multipole and local coefficients. Let’s assume each kernel evaluation takes at most k 
cycles. The aforementioned sub-algorithm operators lead to many small matrix-vector multiplications. We 
assume that the pre-computation of M2L, M2M, and L2L matrices is done before the actual calculation (the 
entries of these matrices are independent of the particle positions). In the current black-box adaptive FMM 
implementation, one application of each operation to an octree node has the following cycle count: 

P2M Ci + C 2 r 4 

M2M C 3 + C 4 r 6 

M2L C 5 + C 6 r 6 

P2L C 7 + C s kr 3 

L2L C 9 + Ci 0 r 6 

L2P Cn + C i2 r 4 

M2P Ci 3 + C 14 fcr 3 

P2P Cis + C 16 k 

where, in the above expressions all Cf s are constants that depend on machine architecture, particle distribu¬ 
tion, subdividing threshold, etc. Note that this cost can be reduced using different methods. For example, 
fast Fourier transforms can be used to reduce the M2L operator to 0(r 3 logr). 


11 


4.2 Numerical results 


In this section, numerical results for a black-box adaptive FMM calculation for various particle distributions 
are presented. Here, we have used r = 4 Chebyshev points in each direction (i.e., a total of 64 points) 
for the low-rank approximation. In Figure [6j the particle clustering for the uniform and spiral distribution 
is depicted. Note that the corresponding tree for the spiral case is not fully adaptive in the sense that it 
has only one concentration point as N —> oo. In Figure [TJ the total FMM cost (in giga cycles) is plotted 
versus the number of points. It is clear that the adaptive FMM has linear complexity, as opposed to the 
direct matrix-vector multiplication that has 0(N 2 ) complexity. We have assumed each kernel evaluation 
takes 19 cycles, each division or square root takes 4 cycles, and each addition, multiplication, subtraction, or 
branching is assumed to take 1 cycle. We count cycles as the code is running. Note that in this analysis we 
are not taking to the account the cost of memory access. Memory access optimization is highly architecture 
dependent. 




Figure 6: Three dimensional adaptive FMM clusters of 1,000 particles with a subdividing threshold t = 10, 
for two different point distributions. Left: uniform; Right: spiral. 



Figure 7: Adaptive FMM total cost (in giga CPU cycles) as a function of the total number of particles, N, 
for the spiral case. 

The plot in Figure [7] is obtained after optimizing the subdividing threshold t. Basically, for each run, we 
have to tune t to get the optimum cost. For a more comprehensive understanding of the optimization, in 
Figure [8] we have illustrated the variation of the optimum threshold versus N for two cases. The optimum 
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threshold is not a unique number. In fact, there is an optimum interval rather than a single number. Any 
choice of t within the optimum interval gives rise to the same adaptive tree with maximum height l op t.■ This 
is the key parameter, which allows minimizing the computational cost. We will discuss this more in fj6| 

The extrema of the intervals in Figure [8] lie along lines with different slopes. The ratio of slopes for 
consecutive lines is a constant number. This ratio depends on the distribution of particles. For instance, in 
the uniform case this ratio is 2 3 . However, for the spiral case it is 2 d , where d < 3. Each group of intervals 
in Figure [8] share the same l op t- 

For each distribution, there are three key values related to the threshold: t m in, tcut , and t max , which are 
illustrated in the right plot in Figure [8j f TO j„ is the minimum acceptable threshold to get the optimum cost 
among different values of N. Similarly, t max is the maximum acceptable optimum threshold. t cut belongs to 
the intersection of all of the optimum threshold intervals. t cu t can also be interpreted as the maximum (over 
different values of N) of the beginning point of the optimum threshold intervals. The optimum threshold 
interval for each l opt starts from f m j„, and ends at t max - The optimum interval shifts as we increase N; 
however, as soon as t cut lies outside the optimum interval, l opt increases by 1, and another group of intervals 
forms. Similar to the slope of bounding lines, we have t max /t cut = t cut /t m i n = 2 d . 

This new quantity, d, determines the behavior of the optimal threshold interval as N increases. For a 
uniform distribution of points d is 3, and for a surface of points in space d is 2. It can take various (and not 
necessarily integer) values for different point distributions. Therefore, we have to look after a generalized 
form of dimension that is not restricted to integer numbers to be able to categorize different non-uniform 
distributions. Parameter optimization is possible afterwards. This is discussed in the next sections. 



Figure 8: Optimum subdividing threshold interval in the adaptive FMM as a function of N for two particle 
distributions: Left: uniform; right: spiral. 


5 FMM and fractal sets 

In (j4j we illustrated how the study of different adaptive trees requires a generalized definition of dimension. 
Essentially, what governs the behavior observed in Figure [8] is the number of occupied nodes in level l of the 
tree. For instance, for a full octree, there are always 2 dl occupied nodes in level l of the tree, where d = 3. 
In general, we are interested in the statistics of the number of occupied nodes in level l of the octree. This 
is determined by the average number of children of a node. In order to formally define the average number 
of children of each node, we need to establish a strategy to increase the number of points (i.e., N —> oo). As 
N goes to oo, its corresponding octree also grows. Therefore, we can look at the average number of children 
of nodes with depth l as l —> oo. 

There is a theoretical difficulty regarding taking the limit of a set of N points as N goes to infinity. If we 
simply consider a sequence of points Xi, and define the set S'jv = {a:i}i=i 1 ... j Ar, the sequence of sets Sn can 
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only converge to a set with a countable number of points. This won’t work for our purpose. The Cantor set 
for example is uncountable. Therefore a different definition of the limit must be considered. 

It turns out that it is easier and intuitively simpler to consider the limit of a sequence of octrees. To do 
this, we need to extend the definition of octree to trees with infinite levels as is defined below. 

Definition 8 (super octree) A super octree is a labeled tree with infinite levels, where every node has exactly 
eight children. Each node is labeled as occupied or empty. A node is occupied only if its parent is occupied. 

A basic octree can be extended to a super octree (by infinitely subdividing) similar to the concept of 
extended tree introduced in fj3] Now, for a given super octree we can formally define the average number of 
children of nodes. 


Definition 9 (occupancy) The occupancy of a super octree T is defined as: 


d ocp (T) = lim 

l —>00 


log 2 (Af ocp ,;) 

l 


(4) 


where N ocP: i is the number of occupied nodes at level l. 

The above definition suggests that in level l of the super octree, on average, 2 dcopl out of 2 dl nodes are 
occupied. However, the above limit does not necessarily exist for a generic super octree. We are going to 
introduce a more general notion of dimension for super octrees that is always defined. To do this, we will 
associate a set of points in R 3 to each super octree. The dimension of the super octree is then defined based 
on the set associated to it. In the following three definitions, the set of points associated with each super 
octree is defined step by step. 


Definition 10 (path) A path V is an infinite sequence of nodes {W} starting from the root of the tree (i.e., 
N 0 is root), where N i+ 1 is a child of N ; for i > 0. An occupied path V is a path that only consists of 
occupied nodes. 


Definition 11 (point associated to a path) LetV be a path consisting of a sequence of nodes {W}, then this 
path is uniquely mapped to a point pi(P) £ [0, l] 3 = f] i T(N i ). 

Note that the intersection of nested closed sets is always non-empty. Since I(Ni)’s are three-dimensional 
boxes with decreasing edge length 2~ s ( Ni ' ) = 2 ~ l , their intersection is a single point. 

Definition 12 (associated set) The associated set of a super octree T is defined as S(T) C [0, l] 3 = 
Up P^Pi)> where Vi is an occupied path in T. 

Conversely, for a given set of points ICK 3 the associated super octree ifH can be created by performing 
the usual adaptive FMM subdividing process. X should be first mapped to [0, l] 3 , and then the subdividing 
process is applied. However, some points may lie exactly on the midpoint of the interval in the subdividing 
process (e.g., points in [0, l] 3 with a finite binary representation). To uniquely define the associated super 
octree, we can always pick the left (or always the right) interval when subdividing, in the case of midpoints. 
Hence, S is a pseudo inverse for S ^ (i.e., S(SA^)) = X). 

Now, we can define the dimension of a generic super octree T as the dimension of its associated set, S(T). 
But, what definition of dimension is the best choice? Let’s begin with the Hausdorff dimension , which is a 
generalized form of dimension. It is defined in two steps as follows m- 

Definition 13 (Hausdorff measure) The d-dimensional Hausdorff measure, pd, of set of points X for any 
d in Rq = [0,oo) is defined as follows: 

Pd(X) = lim inf (diam(Ui)) d , (5) 

e->n UiSU ' 

where the infimum is taken over all countable covers lA t = of X such that diam(Ui) < e for all i. 
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Lemma 7 (Hausdorff dimension) For every bounded set X in a given metric space, there is a unique value 
dn{X) £ U { 00 } such that pd'(X) = 0 if d! > dn(X) and pd'{X) =00 if d! < dn(X). dn{X) is called 
the Hausdorff dimension of X. 


For example, the Hausdorff dimension of any countable set is 0, and the Hausdorff dimension of R n , 
where n £ N, is n. 

We will need the following lemma to prove Corollary [9] 

Lemma 8 (Hausdorff measure for super octrees) Let T be a super octree. Then: 

Td{S{T)) = lim inf £ 2~ ds ™ ( 6 ) 

where the infimum is taken over all countable set of occupied nodes Mi = {iVj}ieN 0 / T, such that 5(Nf) > l 
for all i, and {I(JVj)} a cover for S(T). Note that length of the edges of the cube T^Nf) is 2~ 5 ^ Ni \ 
where S(Nf) denotes the depth of the node Ni in T. 

The next corollary shows that the Hausdorff dimension is always a lower bound for the occupancy (when 
the latter is defined). 


Corollary 9 For any super octree T: 
assuming the existence ofd ocp {T). 


d H (S(T )) < d OC p(T) 


(7) 


Proof If S(T) is a finite set of points, dn(S(T)) = d ocp {T) = 0. So, let’s assume S(T) includes an infinite 
number of points. From the definition of infimum, for any level l: 


N ocPtl 2~ dl = ^2~ dl >inf E 2 


—d5(Ni) 


Ni 


NiZJV, 


where s are all the occupied nodes of level l of the tree. Using Lemma [ 8 ]: 

lim N ocp i 2~ dl > lim inf V 2~ dS ^ = p d (S (T)) 

l—} OO A f, ^ J 


l —^OO J\fi 


NieAfi 


lo S2 ( N ocp,l) 

lim N ocPji 2 docp(T) > n d (5 (T)) 

l—> OO 


1 - 


lim N}„„ > p d (S (T)) 


l—> OO 


y ocp,l 


Since N ocp j -> 00 as l —y 00 , we can conclude: 

p d {S (T)) =0 for d > d ocp (T) 

Hence, based on Lemma[7j dn ( S (T)) < d ocp (T). □ 

Corollary|9]provides a lower bound for our parameter of interest d ocp (T). What we described as occupancy 
is in fact equivalent to another notion of dimension called box counting dimension or capacity. The definition 
of capacity is similar to the Hausdorff dimension except that it only allows coverings with boxes of the same 
size. For uniformly self-similar sets, the Hausdorff dimension and capacity are numerically equal (see [24|). 
The numerical equality is also shown for a broader family of sets, namely multi-scale fractals (see (25]). 
However, for a general set, the Hausdorff dimension and capacity are not equal. In fact, there are many sets 
where capacity is not defined while the Hausdorff dimension is always defined 132 ■ The next three examples 
show different possible scenarios. The results below are given without proof. 


Example 1 Let X be the set of all rational points in [0,1]. Since X is a countable set, the Hausdorff 
dimension is 0. However, since X is dense, the super binary tree corresponding to X is full (i.e., all nodes 
are occupied); therefore, its occupancy (or box counting dimension) is 1. 
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Example 2 Let X be the generalized Cantor set defined as follows. Start with segment [0,1], remove an 
open segment with length 7 £ (0,1) from the middle. In the next step, remove the middle segments of the two 
new subintervals with length ratio 7 , and continue. This is illustrated in Figure [5] The famous middle-third 
Cantor set corresponds to the case 7 = 1/3. Assume X corresponds to a super octree T (i.e., S(T) = X). 
For the generalized Cantor set, the Hausdorff dimension and occupancy are equal: 

d H (X) = d ocp (T) = -- l0 f gL (8) 

log(V) 


Figure 9: Generalized Cantor set 


Example 3 Let’s reconstruct X in the Example^ with the sequence of scales 0 < 7 , < 1. At step i, replace 
all intervals with two subintervals by removing a segment of length ratio 7 j. If 70 = 71 = 72 = • • ■ = 7 , the 
generalized Cantor set results. However, if the sequence { 7 ^} does not converge, the box-counting dimension 
is not defined, while the Hausdorff dimension is always well-defined JSTjj. 

Now, we enter a key part in our formal definitions. The above framework enables us to analyze the 
dimension of super octrees associated to infinite number of points. In practice, we need to work with a finite 
set consisting of N points, and study the point distribution effect as N —> 00 . If we consider a sequence of 
finite sets consisting N points, Sn, as N —» 00 we can only converge to a set with a countable number of 
points. Instead, we will consider a converging sequence of octrees. For example, consider the interval (0,1). 
This is an uncountable set. The super octree for this set is full. We can consider the sequence of octrees Ti 
obtained by simply keeping all the nodes in the super octree down to level i. As i goes to infinity, we are 
correctly modeling N —> 00 for a uniform distribution of points in the FMM. Taking the limit of a sequence 
of octrees is therefore the correct concept for our purpose. 

The overall process is then as follows. We consider an infinite set of points, such as the Cantor set. Then, 
we consider the associated super octree T. We build a sequence of octrees Ti that converges to T. For each 
Ti, we can consider an appropriate distribution of points such that the octree for that set is exactly Ti. As 
i goes to infinity, the number of points in the FMM also goes to infinity. The convergence of the sequence 
of octrees is defined as follows. 

Definition 14 (converging sequence of octrees) A sequence of octrees {Ti} converges to a super octree T s , 
if for any l > 0 there is ni > 0 such that if i > ni, then Ti’s are identical from the root up to level l. 

Remark If cj = k ' S2 j ocp - l , where N ocPt i is the number of occupied nodes at level l of octree Ti with i > ni, 
then limj^oo C; = d ocp {T s ) upon existence of d ocp (T s ). 

Note that the details of the point distribution is not important as long as the sequence of octrees is correct 
(i.e, converges to the desired super octree). This is the main data structure that determines the complexity 
of the FMM. For example, the location of a point in a leaf is irrelevant as far as the computational cost of the 
method goes. From there, depending on the dimension, we will be able to determine the optimal parameters 
for the FMM as well as discuss the running time of the algorithm. 

For the purpose of the adaptive FMM study, we are looking for a set of points, X, leading to interesting 
FMM trees. The generalized Cantor set provides such properties for us. Generalized Cantor sets in R” are 
uncountable , perfect, and nowhere dense. Recall that a perfect set is a closed set with no isolated points, 
and a set that is nowhere dense is a set whose closure has an empty interior. The interpretation of these 
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properties in the context of adaptive FMM are explained in Lemmas 10 and 11 Moreover, the family of 
generalized Cantor set provides the full range of Hausdorff dimensions, from 0 to 3. 

Let 0^ to be a one-dimensional generalized Cantor set with parameter 7 £ (0,1). Each point in this set 
can be described as follows: 


X £ Cry 


= dib 1 where di £ { 0 , a} with a = 




i =0 


We can map C 7 to [0,1] using following surjection: 


f(x) = fC£dib i ) = J2i 2 ~ i 


i—0 i =0 

This mapping readily shows why Cantor set is uncountable, perfect, and nowhere dense. 

Remark An n-dimensional (topologically) generalized Cantor set can be obtained by n times direct product 
of a one-dimensional (A with itself. Therefore, for the three-dimensional case: 


Cry X Cry X Cry C [(), l] 3 UlM rf# {Cry X C ry X Cry) = ~ 3 


log( 2 ) 

log(V) 


Lemma 10 (octree of a perfect set) Let X C [0, l ] 3 be a perfect set. In the octree for the adaptive FMM on 
X, there is no occupied node with a finite number of particles in it. 

Lemma 11 (octree of a nowhere dense set) Let X C [0,1] 3 be a nowhere dense set. In the octree for the 
adaptive FMM on X, there is no full subtree (i.e., a subtree whose nodes are all occupied). 


Before moving on to the next section, and study the role of dn in the performance of the adaptive FMM, 
let’s present one dramatic example to demonstrate why the modified adaptive FMM algorithm introduced in 
0 is necessary. 

Example 4 Define X C [0,1] as follows. Start with [0,1]. At step i consider the current set of intervals. 
Take each interval and subdivide it into two subintervals. Take each interval and reduce its length by 2 2 , 
that is, if the interval starts from a and has length l, the scaled interval starts from a, and has length -Ay . 
The intersection of all of these intervals is X. Basically: 


x £ X <*=> x’s binary form is x = O.dic^dadi ... 
where d\ £ {0,1}, <^2 = ^3 = 0, d± £ {0,1 }, d$ = de = dr = d$ = 0, dg £ {0,1 }, ... 

We can show that X is uncountable, perfect, and nowhere dense. 

Now, the arbitrarily long sequences of 0 ’s guarantees the existence of arbitrarily long branches of singleton 
nodes in the associated adaptive binary tree. Hence, the modified adaptive FMM algorithm is necessary if 
one wants to get linear complexity for such a set. Indeed, if we consider the intervals created at step i, we 
have N = 2* such intervals (assume that we create a set with 2 l points by picking a random point in each 
of the intervals obtained at step i). But, we have fl(N) singleton branches, each of size D(N). Therefore, 
without trimming the singleton branches, the overall cost of the FMM is Ll(N 2 ) in that example. 


6 Optimal choice of the FMM parameters 

In this section, we are going to use the generalized Cantor set introduced in S|5] to study the computational 
cost of the FMM. The same implementation as described in 0 is used. 

In the previous section, we explained how to construct a sequence of octrees that is converging to the 
desired super octree. Here, we consider the super octree associated with the generalized Cantor set C 7 . To 
construct a finite set of points, we truncate the number of levels to a finite number, and pick one point in 
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each leaf node of the resulting tree. For example, in Figure [9j we can pick 16 points from intervals resulted 
at step 4. Figure [TO] shows the clusters of the octree corresponding to a subset of Ci with 4096 points. 

Earlier in the paper, we discussed the subdividing threshold t. This number can take any value between 
1 and N. The former results in a calculation with highly clustered particles, whereas the latter brings about 
direct 0(N 2 ) computation. Its optimum value depends on many parameters such as machine architecture, 
point distribution, low rank approximation order, etc. In [44] . the optimum threshold is picked based on the 
low rank approximation order. Chandramowlishwaran et al. [3] showed how the total cost varies with the 
choice of threshold, and picked the optimum value by tuning. Gumerov et al. |22| discussed the analytical 
optimal value of the subdividing threshold for the uniform distribution of particles. 

We will assume a constant low-rank approximation order (the number of Chebyshev points in our case), 
and study the variation of the total cost, and the optimum threshold as a function of the Hausdorff dimension 
for the generalized Cantor sets. 



Figure 10: Three dimensional Cantor set FMM tree, for N = 4096, t = 1, and dn = 2. For visualization 
purpose it is colored by [R,G,B]=[a;, y, z\. 


6.1 New double-threshold method 

The subdivision threshold t is not the only important parameter for optimizing the computational cost in 
FMM. There are in fact two separate criteria that can be used. One is t, and the other is the maximum 
number of levels (i.e., the maximum depth of the tree). In the literature, the maximum number of levels is 
obtained simply by fixing t 1 and then subdividing all nodes until all leaf nodes have no more than t points. 
For a non-adaptive FMM, where all of the leaf nodes have the same depth, it is possible to find one optimal 
subdividing threshold t [22]. However, for the adaptive case this cannot be optimal. For example, in Figure 
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11 a one-dimensional general Cantor set is partially illustrated. Unlike the uniform case, the leaf nodes have 


different depths, and are sparsely distributed. 

Let’s consider all the operators involved in the FMM. Subdividing a node results in the following. If the 
node is at level / max , we increase M2L, M2M and L2L while reducing P2P. If the node is at a higher level 
in the tree, we also increase M2L, M2M and L2L, but then reduce the cost of the P2L and M2P operators. 
The cost of the P2L/M2P operator is not the same as P2P. Therefore, we should expect that the optimal 
threshold at Z max is different from the optimal threshold higher up in the tree. 



Figure 11: One-dimensional general Cantor set with 7 = 0.4. The figure on the right shows the entire tree. 
The middle figure in green is a zoomed-in copy of the right side of the tree. The left figure in magenta is 
zoomed-in again, showing the “second main branch” in the green tree. This figure shows the adaptive nature 
of the FMM tree for a fractal set. 

The key point is that the M2L operator, depending on the implementation, is relatively cheap compared 
to P2P. The M2L operator is precomputed, whereas in P2P calculation we need kernel evaluation. M2L can 
also be accelerated (not done in this paper though) using a singular value decomposition (in a general case), 
or a fast Fourier transform using uniformly distributed points instead of Chebyshev pointsj^] As a result, the 
optimal threshold at Z max must be small, meaning that it is smaller than the number of Chebyshev nodes 
used for the multipoles and locals. 

In an M2P operator, we need to compute the interaction between particles and Chebyshev points. There¬ 
fore, we expect M2P (resp. P2L) to be more expensive than P2P for the same threshold. The threshold at 
l < Imax should consequently be lower than the threshold at Z max . 

With this insight, we decided to investigate the following subdivision rule. We choose Z max and t, and 
then we: 

subdivide a node iff: level of node < Z max and number of points in node > t 

This condition implies that it is possible for leaf nodes to have more than t points. However, all clusters 
that are not at the level Z max must have less than t points. Therefore, we have two parameters to tune in 
this approach. This is the new method we investigated. Given Z max , we define s(Z max ) to be the maximum 
number of points per leaf node. In Figure [l2j for the class of generalized Cantor sets, s(Z m ax) is plotted. 
In this case, s(Z max ) can be approximated by N/2 ldH , where dn is the Hausdorff dimension of the general 
Cantor set. 

In our approach we have two parameters to tune: t and Z max . The goal is to pick parameters such that the 
total cost of the calculation, COST(Z max , f), gets its optimal value. In the conventional subdividing scheme, 
the optimization is restricted only to the parameter pairs (Z max , s(Z max )). 

For the family of general Cantor sets, we investigated the total FMM cost for different pairs of parameters 
(Zmax, t). Decreasing the value of t , while Z max is constant, is similar to the concept of extended tree introduced 

2 There are stability and accuracy issues with this approach. These can be successfully solved but we won’t discuss this point 
in this paper. 
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Figure 12: s(Z max ) (the maximum number of particles per leaf in a tree with at most Z max levels) as a function 
of l ma x for different dn and N = 10 6 . 


in )J3] Essentially, by decreasing t we are substituting M2P and P2L operations, which are fairly expensive, 
by new M2L operations that are cheaper. It turns out that the total cost changes slightly at small values 
of t for different Z max . Therefore, settling ont= 1, and search for the optimal l max is a simple near-optimal 
choice. 

In Figure [l3j the total cost as a function of t for dn = 1 and d H = 2 is shown. To obtain these plots, 
at first, with t = 1 we found the optimal maximum depth Z max . Then for the optimal Z max , we changed the 
subdividing threshold from t = 1 to t = s(Z max ). We see that the choice of threshold t = 1 is very close to 
the optimum. This is the case for different values of dn ■ In this figure, we observe the rapid increase in M2P 
and P2L as t increases. This agrees with our analysis that small values of t are preferable. 



Figure 13: The adaptive FMM total cost for N = 10 6 particles with general Cantor set distributions. We 
picked l op t using the new subdividing method. Each sub-algorithm cost is shown as t increases from 1 to 
s(^max)- Left: dn = 1- Right: dn = 2. This figure shows that small values of t are preferable. This is caused 
by the rapid increase in the cost of M2P and P2L. 

In Figure [l4j we have plotted the optimal total cost as a function of dn for the conventional and new 
subdividing schemes. Note that the total cost function takes its optimal value at different /max’s for each 
scheme. Observe that the new and the conventional threshold schemes become the same as dn —)■ 3, since 
in that case all leaves are lying in the finest level. 

From Figure[l3j we see that for t = 1, the major part of the total cost is due to M2L, P2P, P2M, and L2P 
operators. Since N and r are fixed, the P2M and L2P costs are fixed. In the next part, we have introduced 
an ad hoc analysis to find the relation between the computational cost and dn, keeping everything else fixed 
(i.e., N, r, the computer hardware, etc). 
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Figure 14: Comparison between the optimal cost of the conventional subdividing method and the new 
method. We note that a greater improvement would be obtained with our new method if we had implemented 
an optimized M2L operator. For cLh = 3, both methods become identical since all leaf nodes are at level 

^max- 


6.2 Heuristic optimization analysis 

In §6.1[ we showed that the M2L and P2P operators are the key operators that determine the total cost, 
while other operators have either constant or small cost. In this section by total cost we refer to the sum of 
the cost of P2P and M2L operations. 

Let’s begin with the following assumptions using the definition of occupancy (which happens to be the 
same as Hausdorff dimension for the family of fractal sets that we are studying). Assume that the maximum 
depth of the tree is l. 

• Number of occupied nodes with depth i ~ 2 dHl 

• Number of leaves ~ 2 dnl 

• Number of particles per leaf ~ N/2 dirl 

• Number of P2P interactions per leaf ~ 3 d - 

• Number of M2L interactions per node ~ 6 d - — 3 d - 

• Total number of occupied nodes in the tree ~ (2 dff )° + (2 dH ) 1 + ( 2 d -) 2 + ... + ( 2 dn ) l = *- 2 _ 1 " 1 

Based on the above assumptions, the P2P+M2L cost is: 

at o dnl+dn _ i 

COST ~ — + p 2dH _ 1 (6 dH - 3 dH ) (9) 


where, a and (3 depend on other parameters, such as machine architecture, low rank approximation, etc. We 
are going to optimize the above function with respect to l. Set f|COST = 0: 


, lo g 2 (f) , log 2 N 1 

lopt ^ 2 d H + d H 2 

Plug 1 0 pt in Equation (|9| to obtain COST opt : 

COST opt ~ 3 d " (2^/cxf3N2 dH / 2 - p) 


( 10 ) 


( 11 ) 


Note that for l = l opt we have COSTp 2 p — COST^/[ 2 l- Fot large values of N, we can ignore the second 
term in the Equation (11), and rewrite it as: COST opt ~ 2 dH ^ 2 3 dH N, which is consistent with our linear 
complexity expectation. From this heuristic analysis, we can conclude the following models. 
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Model of the optimum tree height: 

fractal sets is of the form 


l 


opt — 


The optimum height of the tree in the adaptive FMM with 
Ki + \og 2 (N) 


where K\ and /v 2 are functions of other parameters. 

Model of the optimum cost: The optimum P2P+M2L cost in the adaptive FMM with fractal sets is 
of the form 

log(COST opt ) ~ K 3 + log (N) + 1.445 d H , 


where K$ depends on the other parameters. 

In Figure |15[ we have verified the above models, for generalized Cantor sets with Hausdorff dimension 
varying from dn = 1 to dn = 3. The predicted behavior was obtained with good accuracy. The least square 
line in the right plot of Figure [l5| has slope of 1.2 which is close to the analytical prediction log(3)+log(2)/2 = 
1.44 (this gets better for larger values of N). Also this plot shows that the worst case in the adaptive FMM is 
when the tree is full. It is consistent with the analysis presented in ([3] where we introduced linear complexity 
for adaptive FMM. 

In fj5]we showed that the average number of children of a node is what determines the optimal parameters 
and optimal cost. This can be formalized by the Hausdorff dimension. Note that even with a finite number 
of points, the concept of average number of child nodes is well-defined, and can be used to tune the FMM 
using the above models. 
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Figure 15: Numerical verification for the heuristic analysis with dn varying from 1 to 3 for N = 10 6 . Left: 
variation of l op t as a function of 1/dn- Right: The optimum cost for the new proposed subdividing scheme. 
Almost linear variation of log(COST opt ) as a function of dn is shown. 


7 Conclusion 

In this paper, we provided a new proof of the linear complexity of the FMM in the general case. Note that 
our proof focuses on the cost of the FMM operators and does not consider the cost of building the FMM tree. 
We introduced the required modifications to the regular adaptive FMM in order to achieve O(N) complexity 
for arbitrary point distributions. The standard adaptive FMM may require arbitrarily long calculations for 
certain point distributions. 

In order to study different adaptive FMM cases, we defined a process to properly add points and having 
N —► 00 . For this, we need to introduce the concept of super octree T (an octree with an infinite number 
of levels) and establish a connection between a set of points and its associated super octree. This represents 
a significant extension of previous studies of the adaptive FMM, which are limited to a narrower range of 
types of distributions. We then focused on the fractal dimension of sets, e.g., the Hausdorff dimension, as a 
useful parameter to understand the complexity of the adaptive FMM and determine optimal parameters for 
the FMM. 
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We showed that the usual criterion to determine whether a node should be subdivided or not, based on a 
threshold for the number of points in leaf nodes, is insufficient. Indeed, using a single threshold t throughout 
the tree is sub-optimal. At ^ max , the optimal t is essentially found by finding a balance between P2P and 
M2L. Instead, at lower levels in the tree, the operators M2P and P2L need to be taken into account and 
those are typically significantly more expensive than the M2L operator. As a result, the threshold for leaves 
at l max should be chosen differently from the threshold at lower depths. Consequently, we considered an 
optimization procedure with two parameters: the maximum number of levels / max and a threshold t. The 
conventional optimization with a single threshold t is strictly a special case of our approach. 

We introduced a near-optimal solution to this new optimization problem. Using generalized Cantor sets, 
we showed that the new scheme has better performance compared to the conventional threshold method. 
The improvement would be even more evident if we had implemented an improved version of the M2L 
operator (for example accelerated using FFTs). Moreover, our near-optimal solution shows that using X-list 
and W-list is not necessarily beneficial in the implementation of the adaptive FMM. 

Finally, we established how the tree occupancy, which is the same as the Hausdorff dimension in the case 
of self-similar sets, affects the optimum parameters in the adaptive FMM. Based on this, we provided an 
optimization model to find the number of levels in the tree and the cost of the total calculation, as a function 
of the dimension of the point distribution. 
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